In this R markdown we analyze the Traumabase Registry to estimate the treatment effect of the drug Tranexamic Acid (TXA) on mortality among patients with traumatic brain injury (TBI).
For this analysis, an expert committee has identified 17 confounders available in the Traumabase: Trauma.center, Cardiac.arrest.ph, SBP.ph.min, DBP.ph.min, HR.ph.max, SBP.ph, DBP.ph, HR.ph, Shock.index.ph (HR.ph/SBP.ph), Cristalloid.volume, Colloid.volume, SpO2.ph.min, Vasopressor.therapy, HemoCue.init, Delta.hemoCue, AIS.external, Activation.HS.procedure.
library(cobalt)
library(ggplot2)
library(dplyr)
library(reshape2)
library(tableone)
library(MASS)
library(pracma)
library(assertthat)
library(caret) #
library(ranger)# Random Forests prediction
library(FactoMineR) # Factorial data analysis
library(grf) # Doubly robust treatment effect estimation
library(norm)
library(missMDA) # PCA/MCA with missing values + iterative PC imputation
library(mice) # Multiple imputation
library(VIM) # Missing values exploration and visualization
# Define data directory
data.dir <- "~/Documents/TraumaMatrix/CausalInference/AcideTranexamique/Data/"
rdata.dir <- "~/Documents/TraumaMatrix/CausalInference/Simulations/causal-inference-missing/TranexamicAcid/RData/"
fig.dir <- "~/Documents/TraumaMatrix/CausalInference/Simulations/causal-inference-missing/TranexamicAcid/Figures/"
library(devtools)
# Load the ATE estimation functions `ipw` and `dr` from the GitHub repository
source_url("https://raw.githubusercontent.com/imkemayer/causal-inference-missing/master/Helper/helper_causalInference.R")
## SHA-1 hash of file is 72aa0bae8732bb83bbb18ae5189baae2b7171c71
source_url("https://raw.githubusercontent.com/imkemayer/causal-inference-missing/master/Helper/helper_imputation.R")
## SHA-1 hash of file is 66fca1a6480553b69007d1f9896cd3097606fb98
save_results <- FALSE
tau <- NULL
# Set random generator seed for reproducible results
set.seed(0)
Run R markdown preprocess_ate_analysis_traumabase_example.Rmd prior to this to pre-process and save the data for the following analyses.
data_indTBI <- read.csv(paste0(data.dir,
"ate_analysis_traumabase_large_data_preprocessed_tbi_individuals.csv"),
row.names = 1)
seed <- 1234
synthetic <- FALSE
X.na <- data_indTBI[,setdiff(colnames(data_indTBI),
c("TBI", "Tranexamic.acid", "Death"))]
W <- data_indTBI$Tranexamic.acid
Y <- data_indTBI$Death
covariate_names <- colnames(X.na)
n <- dim(X.na)[1]
p <- dim(X.na)[2]
df.na <- data.frame(cbind(X.na, W=W, Y=Y))
colnames(df.na) <- c(covariate_names, "W", "Y")
This pipeline requires W to be binary and coded either with {0,1} or with {FALSE,TRUE}, representing {control,treatment}.
assert_that(length(unique(W))==2)
## [1] TRUE
assert_that(unique(W) %in% c(0,1) || unique(W) %in% c(FALSE,TRUE))
## [1] TRUE
In certain cases, not all covariates are necessarily confounders. Therefore we specify the set of confounders in confounder_names. Additionally, in some cases there are also variables that are only predictive of the treatment assignment. Their names can be specified in only_treatment_pred_names.
confounder_names <- c("Trauma.center",
"SBP.ph", "DBP.ph", "HR.ph",
"SBP.ph.min", "DBP.ph.min", "HR.ph.max",
"Cardiac.arrest.ph", "Vasopressor.therapy", "SpO2.ph.min",
"AIS.external", "HemoCue.init", "Delta.hemoCue",
"Shock.index.ph", "Cristalloid.volume", "Colloid.volume",
"Activation.HS.procedure")
only_treatment_pred_names <- c()
only_outcome_pred_names <- setdiff(covariate_names,
c(confounder_names, only_treatment_pred_names))
For the DR estimation, we can specify clusters, here we will use the Trauma centers as clusters.
cluster_names <- c("Trauma.center")
We list binary and other categorical variables to ensure correct type casts.
binary_names <- intersect(covariate_names,
c("Cardiac.arrest.ph", "Anticoagulant.therapy",
"Antiplatelet.therapy", "EVD", "Activation.HS.procedure",
"Neurosurgery.day0", "Fi02", "Vasopressor.therapy"))
categorical_names <- intersect(covariate_names,
c("Trauma.center", "Pupil.anomaly",
"Osmotherapy", "Osmotherapy.ph",
"Pupil.anomaly.ph",
"Improv.anomaly.osmo"))
It is possible that during the loading of the data set, certain variables are not cast correctly. We check that the variable types are all correct according to the specified types in binary_names and categorical_names.
Look at the Activation.HS.procedure variable. The treatment TXA is generally given to prevent the occurrence of hemorrhagic shock, a condition which is often fatal. Let us see how is the treatment distributed w.r.t. this variable.
res.table <- table(df.na$W==1, df.na$Activation.HS.procedure,
dnn = c("Treated","Activation.HS.procedure"),
useNA = "ifany")
print(res.table)
## Activation.HS.procedure
## Treated FALSE TRUE <NA>
## FALSE 6939 345 281
## TRUE 263 392 28
First we compute the unadjusted ATE, ignoring the confounders.
ate_raw <- mean(Y[which(as.logical(W))]) - mean(Y[which(!as.logical(W))])
Let us also assess how the outcome is distributed with respect to the treatment group.
if (length(unique(Y))==2){
res.table <- table(as.factor(W), as.factor(Y),
dnn = c("W","Y"))
prop.table(res.table)
} else {
ggplot(data.frame(W=as.factor(W), Y=Y)) +
geom_histogram(aes(Y, color=W), fill = "white")
}
## Y
## W 0 1
## 0 0.75630456 0.16088749
## 1 0.04449564 0.03831232
Among the treated and the control most patients survived. But proportionally, more patients died in the treated group than in the control group. Strong indication for confounding factors that need to be controlled for.
From these two variables we can compute the following observed probabilities:
sprintf("Average outcome: %2.1f", mean(Y))
## [1] "Average outcome: 0.2"
sprintf("P(Treatment=1) : %.2f", mean(W==1))
## [1] "P(Treatment=1) : 0.08"
sprintf("E(Y | Treatment=1) : %.2f", mean(Y*(W==1))/mean(W==1))
## [1] "E(Y | Treatment=1) : 0.46"
sprintf("E(Y | Treatment=0) : %.2f", mean(Y*(W==0))/mean(W==0))
## [1] "E(Y | Treatment=0) : 0.18"
sprintf("ATE without adjustment : %.4f", ate_raw)
## [1] "ATE without adjustment : 0.2873"
These uncorrected empirical quantities suggest that the treatment has a negative impact on the chances of survival.
However, treated patients tend to have lower Glasgow.initial - and are therefore potentially more severe since according to the practitioners this covariate is a predictor for the patient’s severity - which highly suggests that the treatment is not given uniformely at random.
ggplot(data = data.frame(X.na, W=as.factor(W))) +
geom_density(aes(x=GCS.init, color = W, fill = W), alpha = .2)
We impute the data with two different methods (iterative FAMD, mice).
Note that for this approach, an alternative would be to go back to the original dataset, instead of focusing just on the patients with TBI and/or AIS.head >= 2 to impute the missing data. Even though we will only keep the patients with TBI and/or AIS.head>=2 while running the causal analysis, including the other patients at this stage can allow the imputation of missing data to be more informed by adding additional observations for comparison purposes.
Let’s look at the matrix plot to identify if there are any variables that tend to be missing together or if we can detect any clear violation of the missing completely at random hypothesis:
matrixplot(df.na, sortby = 1, las=2, cex.axis = 0.3)
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
We can also run MCA on the missing and non missing entries to double check our conclusion and to check if other groups of variables tend to be missing together:
data_miss <- data.frame(is.na(df.na))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))
res.mca <- MCA(data_miss, graph = FALSE)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex =0.5)
We now perform a single imputation with a (regularized) iterative Factorial Analysis for Mixed Data model, which will allow imputation to take into account similarities between both individuals and relationships between variables. See: https://arxiv.org/pdf/1301.4797.pdf
First we find the optimal number of dimensions for the FAMD by cross-validation
Attention: this computation might take a long time. Set eval=F in the chunk header if you want to skip this part.
if (file.exists(paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))) {
load(paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))
plot(1:length(ncomp_famd$criterion), ncomp_famd$criterion, xlab = "nb dim", ylab = "MSEP")
} else {
ncomp_famd <- estim_ncpFAMD(df.na,
ncp.min = 1,
ncp.max=15,
method= "Regularized",
method.cv = "Kfold",
nbsim=10,
verbose = TRUE)
save(ncomp_famd,
file = paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))
}
By cross-validation we find ncp=6 to be the optimal number of components.
if (file.exists(paste0(rdata.dir, "traumabase_tbi_", length(confounder_names),
"confounders_", length(covariate_names),
"variables_imputed_famd.RData"))){
load(file=paste0(rdata.dir, "traumabase_tbi_", length(confounder_names),
"confounders_", length(covariate_names),
"variables_imputed_famd.RData"))
} else {
ncp <- 6
df.imp.pc <- get_FAMD(df.na, seed = 0, ncp=ncp, Method="Regularized",
scale=T, threshold = 1e-06)
save(df.imp.pc, file=paste0(rdata.dir, "traumabase_tbi_",
length(confounder_names),
"confounders_", length(covariate_names),
"variables_imputed_famd.RData"))
}
It is possible that during the loading of the imputed data set, certain variables are not cast correctly. We check that the variable types are all correct according to the specified types in binary_names and categorical_names.
We also prepare a multiple imputation analysis, using the mice package. We impute the data m=5 times.
if (file.exists(paste0(rdata.dir, "traumabase_tbi_", length(confounder_names),
"confounders_", length(covariate_names),
"variables_imputed_mice.RData"))) {
load(file=paste0(rdata.dir, "traumabase_tbi_", length(confounder_names),
"confounders_", length(covariate_names),
"variables_imputed_mice.RData"))
} else {
df.imp.mice <- get_MICE(df.na, seed=0, m=5)
save(df.imp.mice, file=paste0(rdata.dir, "traumabase_tbi_",
length(confounder_names), "confounders_",
length(covariate_names),
"variables_imputed_mice.RData"))
}
Again we check and correct the variable types.
We are interested in estimating the causal effect of TXA on the mortality of TBI patients.
Although most of the patients receive multiple treatments (in pre-hospital and hospital phase) and we have records of these other treatments (for instance osmotherapy, fibrinogene, etc.) we are only considering the effect of TXA on in-hospital mortality.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_treatment_pred_names)` instead of `only_treatment_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(confounder_names)` instead of `confounder_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
| Estimand | Imputation.method | PS.estimation | Mask | ATE.normalized | STD.ATE.normalized | ATE.unnormalized |
|---|---|---|---|---|---|---|
| ATE | pc.imp | glm | FALSE | 0.29 | 0.10 | 0.25 |
| ATE | pc.imp | grf.ate | FALSE | 0.17 | 0.04 | 0.09 |
| ATE | mice | glm | FALSE | 0.27 | 0.11 | 0.23 |
| ATE | mice | grf.ate | FALSE | 0.17 | 0.04 | 0.07 |
| ATT | pc.imp | glm | FALSE | 0.30 | 0.03 | 0.00 |
| ATT | pc.imp | grf.ate | FALSE | 0.10 | 0.03 | 0.01 |
| ATT | mice | glm | FALSE | 0.12 | 0.05 | 0.00 |
| ATT | mice | grf.ate | FALSE | 0.11 | 0.03 | 0.01 |
| ATC | pc.imp | glm | FALSE | 0.29 | 0.11 | 0.25 |
| ATC | pc.imp | grf.ate | FALSE | 0.18 | 0.05 | 0.08 |
| ATC | mice | glm | FALSE | 0.29 | 0.12 | 0.23 |
| ATC | mice | grf.ate | FALSE | 0.17 | 0.05 | 0.06 |
| ATE_overlap | pc.imp | glm | FALSE | 0.11 | 0.03 | 0.01 |
| ATE_overlap | pc.imp | grf.ate | FALSE | 0.11 | 0.03 | 0.01 |
| ATE_overlap | mice | glm | FALSE | 0.12 | 0.03 | 0.01 |
| ATE_overlap | mice | grf.ate | FALSE | 0.12 | 0.03 | 0.01 |
| ATE | pc.imp | glm | TRUE | 0.31 | 0.13 | 0.27 |
| ATE | pc.imp | grf.ate | TRUE | 0.17 | 0.04 | 0.08 |
| ATE | mice | glm | TRUE | 0.28 | 0.13 | 0.24 |
| ATE | mice | grf.ate | TRUE | 0.17 | 0.04 | 0.06 |
| ATT | pc.imp | glm | TRUE | 0.28 | 0.04 | 0.00 |
| ATT | pc.imp | grf.ate | TRUE | 0.10 | 0.03 | 0.01 |
| ATT | mice | glm | TRUE | 0.12 | 0.06 | 0.00 |
| ATT | mice | grf.ate | TRUE | 0.11 | 0.03 | 0.01 |
| ATC | pc.imp | glm | TRUE | 0.31 | 0.14 | 0.27 |
| ATC | pc.imp | grf.ate | TRUE | 0.18 | 0.05 | 0.07 |
| ATC | mice | glm | TRUE | 0.30 | 0.14 | 0.25 |
| ATC | mice | grf.ate | TRUE | 0.17 | 0.05 | 0.05 |
| ATE_overlap | pc.imp | glm | TRUE | 0.11 | 0.03 | 0.01 |
| ATE_overlap | pc.imp | grf.ate | TRUE | 0.11 | 0.03 | 0.01 |
| ATE_overlap | mice | glm | TRUE | 0.11 | 0.03 | 0.00 |
| ATE_overlap | mice | grf.ate | TRUE | 0.12 | 0.03 | 0.01 |
| Estimand | Imputation.method | PS.estimation | Mask | ATE.normalized | STD.ATE.normalized | ATE.unnormalized |
|---|---|---|---|---|---|---|
| ATE | mia | grf.ate | TRUE | 0.17 | 0.05 | 0.08 |
| ATT | mia | grf.ate | TRUE | 0.10 | 0.03 | 0.01 |
| ATC | mia | grf.ate | TRUE | 0.17 | 0.05 | 0.07 |
| ATE_overlap | mia | grf.ate | TRUE | 0.11 | 0.03 | 0.01 |
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(confounder_names)` instead of `confounder_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_treatment_pred_names)` instead of `only_treatment_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_outcome_pred_names)` instead of `only_outcome_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cluster_names)` instead of `cluster_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
| Estimand | Imputation.method | PS.estimation | Mask | ATE | STD |
|---|---|---|---|---|---|
| ATE | pc.imp | glm.grf | FALSE | 0.15 | 0.07 |
| ATE | pc.imp | grf.ate | FALSE | 0.05 | 0.04 |
| ATE | mice | glm.grf | FALSE | 0.10 | 0.05 |
| ATE | mice | grf.ate | FALSE | 0.03 | 0.04 |
| ATT | pc.imp | glm.grf | FALSE | 0.14 | 0.11 |
| ATT | pc.imp | grf.ate | FALSE | 0.04 | 0.03 |
| ATT | mice | glm.grf | FALSE | 0.03 | 0.03 |
| ATT | mice | grf.ate | FALSE | 0.05 | 0.03 |
| ATC | pc.imp | glm.grf | FALSE | 0.13 | 0.07 |
| ATC | pc.imp | grf.ate | FALSE | 0.05 | 0.04 |
| ATC | mice | glm.grf | FALSE | 0.10 | 0.06 |
| ATC | mice | grf.ate | FALSE | 0.02 | 0.04 |
| ATE_overlap | pc.imp | glm.grf | FALSE | 0.04 | 0.02 |
| ATE_overlap | pc.imp | grf.ate | FALSE | 0.04 | 0.03 |
| ATE_overlap | mice | glm.grf | FALSE | 0.04 | 0.02 |
| ATE_overlap | mice | grf.ate | FALSE | 0.04 | 0.03 |
| ATE | pc.imp | glm.grf | TRUE | 0.16 | 0.08 |
| ATE | pc.imp | grf.ate | TRUE | 0.04 | 0.04 |
| ATE | mice | glm.grf | TRUE | 0.11 | 0.06 |
| ATE | mice | grf.ate | TRUE | 0.03 | 0.04 |
| ATT | pc.imp | glm.grf | TRUE | 0.12 | 0.10 |
| ATT | pc.imp | grf.ate | TRUE | 0.04 | 0.03 |
| ATT | mice | glm.grf | TRUE | 0.02 | 0.03 |
| ATT | mice | grf.ate | TRUE | 0.05 | 0.03 |
| ATC | pc.imp | glm.grf | TRUE | 0.15 | 0.08 |
| ATC | pc.imp | grf.ate | TRUE | 0.05 | 0.04 |
| ATC | mice | glm.grf | TRUE | 0.12 | 0.07 |
| ATC | mice | grf.ate | TRUE | 0.03 | 0.04 |
| ATE_overlap | pc.imp | glm.grf | TRUE | 0.03 | 0.02 |
| ATE_overlap | pc.imp | grf.ate | TRUE | 0.04 | 0.03 |
| ATE_overlap | mice | glm.grf | TRUE | 0.03 | 0.02 |
| ATE_overlap | mice | grf.ate | TRUE | 0.05 | 0.03 |
| Estimand | Imputation.method | PS.estimation | Mask | ATE | STD |
|---|---|---|---|---|---|
| ATE | mia | grf.ate | FALSE | 0.06 | 0.04 |
| ATT | mia | grf.ate | FALSE | 0.04 | 0.03 |
| ATC | mia | grf.ate | FALSE | 0.06 | 0.05 |
| ATE_overlap | mia | grf.ate | FALSE | 0.04 | 0.03 |
| ATE | mia | grf.ate | TRUE | 0.06 | 0.04 |
| ATT | mia | grf.ate | TRUE | 0.04 | 0.03 |
| ATC | mia | grf.ate | TRUE | 0.07 | 0.05 |
| ATE_overlap | mia | grf.ate | TRUE | 0.04 | 0.03 |
We will plot the results for the ATE estimation. Alternatively, you can also plot the results for the other estimands (in the call of the function plot_treatment_effect, set estimand to ATT, ATC or ATE_overlap).
## [1] "GLM"
## [1] "GRF"
The standardized mean differences (SMD) higher than some threshold th, for instance th=0.1, indicate that the covariate distributions in the two groups differ: the treatment is not given uniformly at random. This explains the need for some adjustment or balancing in order to perform a causal analysis of the treatment on a certain outcome.
Note: small SMDs do not necessarily indicate balanced treatment groups, indeed the distributions can differ on other quantities than the first order.
We use propensity scores estimated via the function regression_forest of the grf package.
X.imp.pcWe can also use the previously computed weights to look at the balance of the response pattern.
incomplete_confounders <- confounder_names[which(sapply(X.na[,confounder_names],
function(x) sum(is.na(x))>0))]
R <- data.frame(is.na(X.na[,incomplete_confounders]))
colnames(R) <- paste(incomplete_confounders, "_NA", sep="")
balance <- bal.tab(R, treat=W, estimand="ATE", weights=weights, method = "weighting", un=TRUE)
love.plot(x=balance, stat="mean.diffs", abs=TRUE, var.order="unadjusted",
threshold=0.1, cex=0.8, shapes=c("circle", "triangle"), colors=c(viridis::viridis(10)[3], viridis::viridis(10)[9]),)
bal.plot(x=data.frame(treat=as.logical(W), ps=w.hat, weights=weights), var.name="ps", which="both",
treat=as.logical(W),
weights=weights,
type="histogram", mirror=TRUE)
X.imp.miceX.na using MIA## Warning: Missing values exist in the covariates. Displayed values omit these
## observations.